Skip to main content

Versioning and Change Management

The Problem: You improve your prompt. It works great in testing. You deploy it. Customer complaints spike. Why It Happens:
  • Test set doesn’t cover real distribution
  • Edge cases appear in production
  • Model updates can break prompts
Production Pattern: Prompt Registry
# prompts.yaml
prompts:
  contract_summarizer:
    v1:
      created: "2025-01-01"
      prompt: "..."
      status: "deprecated"
      metrics:
        accuracy: 0.78
        avg_cost: 0.12
    
    v2:
      created: "2025-01-15"
      prompt: "..."
      status: "active"
      rollout: 100%
      metrics:
        accuracy: 0.92
        avg_cost: 0.08
    
    v3:
      created: "2025-02-01"
      prompt: "..."
      status: "testing"
      rollout: 10%
      metrics:
        accuracy: 0.94  # Looking good!
        avg_cost: 0.09

# Code
def get_prompt(name: str, user_id: str = None) -> str:
    """
    Load prompt with A/B testing support.
    """
    config = load_prompts()[name]
    
    # Determine version
    if user_id and should_test(user_id, config["v3"]["rollout"]):
        version = "v3"
    else:
        version = "v2"
    
    return config[version]["prompt"]

Monitoring and Observability

What to Track:
@monitor
def generate_response(prompt: str, user_id: str) -> str:
    start_time = time.time()
    
    response = llm.generate(prompt)
    
    # Log metrics
    log_metrics({
        "user_id": user_id,
        "prompt_version": get_version(prompt),
        "latency": time.time() - start_time,
        "input_tokens": count_tokens(prompt),
        "output_tokens": count_tokens(response),
        "cost": calculate_cost(prompt, response),
        "timestamp": datetime.now()
    })
    
    # Check for issues
    if detect_hallucination(response):
        alert("Possible hallucination detected")
    
    if detect_prompt_injection(prompt):
        alert("Prompt injection attempt")
    
    return response
Dashboard Metrics:
  • Requests per minute
  • Average latency (p50, p95, p99)
  • Cost per request
  • Error rate
  • User satisfaction (thumbs up/down)
  • Cache hit rate
  • Model distribution (if cascading)
AI Observability Tools: Several tools can help you implement comprehensive monitoring:
  • Open Source: Phoenix, LangFuse, Opik
  • Commercial: Arize, AgentOps
  • Product-Specific: Langsmith (for LangChain and LangGraph applications)
These tools provide features like prompt versioning, cost tracking, latency monitoring, and quality metrics out of the box.

The Production Checklist

Before deploying any LLM feature: Testing:
  • Evaluation dataset created (100+ examples)
  • Accuracy meets requirements (>90%)
  • Edge cases tested
  • Failure modes documented
Cost:
  • Cost per request measured
  • Caching implemented where possible
  • Model selection optimized
  • Budget alerts configured
Safety:
  • Input validation in place
  • Output validation in place
  • Prompt injection defenses tested
  • Fallback behavior defined
Observability:
  • Metrics logging configured
  • Alerts set up
  • Dashboard created
  • On-call runbook written
Deployment:
  • A/B testing framework ready
  • Rollout plan defined (10% → 50% → 100%)
  • Rollback procedure documented
  • Customer communication prepared